Whilst interviewing Candr CEO Chris Dicker he reels off the names of the bots scraping its flagship site – Trusted Reviews – and stealing its content.
“Oxy Labs, Gina AI, Zen Rose, Scrape AI, Zite, Scrapeless, Scrape Stack, Hyper Browser, Exa, Bright Data. I mean, there’s loads of them.”
He works with Cloudflare and Tollbit to control and catalogue the bot attacks, but still they come. The aforementioned companies appear to be third-party scrapers – effectively content traffickers – who then sell content on to LLMs.
Trusted Reviews is also being scraped regularly by bots directly representing the main LLM companies including OpenAI, Google and Meta.
Having exhausted all other avenues, Dicker is among the publishers adopting a new tactic to protect their businesses: adding search-only contracts to website terms and conditions.
Trusted Reviews now includes a notice on its website which tells the LLMs “access to the website is governed by a contractual payment obligation”. This blanket approach replaces previous robots.txt notices asking various named bots to keep out of the site.
Taking down names and cataloguing abuse
For some months Dicker has been taking names and details of the bots that continue to ignore this notice (all of them apparently). And he could send the companies concerned thousands of invoices for £500 per article with the backstop of enforcing debt recovery through the UK small-claims court system.
Some tech companies believe there is no copyright in facts and they are already spending millions in the courts to prove this point. The search-only contracts created by the Movement for an Open Web sidestep this thorny issue.
Dicker said: “This is about terms and conditions law, which ultimately has been in place globally since the Roman times. This is a different route to go down, where there’s less ability to argue kind of nuance on it.”
Trusted Reviews has been around since 2003 and has become a major authority on consumer technology.
The bedrock of its content is thorough, in-depth reviews of products based on weeks of testing which can run to 5,000 words. Around 18 months ago Dicker realised this content was seen as “incredibly valuable” to AI companies who began helping themselves to it on a massive scale.
[Read more: AI bots bombard publisher websites with ‘no meaningful value exchange‘]
He said: “We came in to find our site all of a sudden coming down…We found out that OpenAI were hitting our site 1.6 million times in a day.
“And despite us having a robots.txt notice in place, telling them that they shouldn’t be accessing our site, they were just ignoring it and coming straight through.”
The site was taken down several times and Candr was told that its hosting costs would have to go up because of the additional load on servers.
The site’s content was being stolen “at a ridiculous pace” and Candr was being sent a bill to cover the costs.
Dicker said: “We reached out to OpenAI on all their emails and across Linkedin. I even managed to get a couple of their people’s mobile numbers from contacts of mine.
“I called them, left messages on their mobile, and heard absolutely nothing from them. When I did finally manage to corner one of them at an event and told them what had happened, they said, ‘oh, you should just reach out to us’. And I was like, ‘that’s interesting, because I did’.
“’I even called your mobile number numerous times and didn’t hear anything back’ and they just ran off as quickly as possible.”
On the day Trusted Reviews was hit 1.6 million times by OpenAI’s bots, apparently taking information in response to user queries, some 300 actual website clicks were delivered by ChatGPT. Dicker said: “There is no value exchange.”
The current rate of bot website visits to Trusted Reviews is around 200,000 per day.
Huge investment in content which is being stolen by LLMs
Candr invests a great deal in creating its content which in turn attracts strong advertising from global tech brands keen to reach eight million readers who are actively seeking buying advice. It runs two testing facilities and employs 25 full time and around 100 part-time and freelance staff.
All its testing is done in-house, something Dicker says is “incredibly expensive to do”.
The impact of the LLMs on traffic has been “horrendous”, Dicker said, mirroring a pattern which has been seen across most leading publisher websites in the UK and US.
He said that for anyone publishing evergreen content the LLMs have scraped it once or twice and then looked to replace the website they took it from.
“We’ve had to relook at our content strategy… So if you look at how-to content, explainer guides, things that would sit on the site and be evergreen pieces of content… that just disappears now.”
Dicker said that everyone agrees the AI companies are “breaking the law” by “pulling off the biggest copyright theft in history”.
But he added: “The creators aren’t able to afford to hold these companies to the law of the land… Google or an Amazon or a Microsoft have got more money than many countries, so it it would be very hard to take them on.
“These terms and conditions help to rebalance that.
“This isn’t necessarily about copyright, it’s about the breaking of terms and conditions.
“For us, it’s a case of capturing the infringements and then we’ll go after them, and that may be a case of going after the third-party scrapers.”
He said these companies “annoy the hell out of him” because “you’ve got these big AI companies not willing to entertain collective licensing agreements for people and yet they’re willing to buy the same people’s data through a third party who has scraped it.
“And it has created a billion-dollar industry, which is just madness. Because they could quite easily do the same deal directly with the publishers and the content creators.”
All options are currently on the table for Candr, including the possibility of invoicing a trillion-dollar US tech giant and, if necessary, seeking recovery of the debt at a local county court.
“We haven’t decided what we do with that right now. The whole thing at the moment is what we’ve been doing for years: capturing the abuse and the proof of the abuse.”
What is his advice to other publishers, who may be concerned about antagonising some of the most powerful companies in the world?
“There is no harm at all in implementing these terms and conditions, collecting the data, seeing who is infringing on them, and then deciding what to do.”
He added: “No one wants to jeopardise their search traffic and these terms and conditions don’t do that, they still allow the purpose of your content to be used for search.
“Any publisher that doesn’t have in-house legal counsel, I would say at least should take full advantage of the free terms and conditions and the tens of thousands of pounds worth of free legal aid that the Movement for an Open Web have created for you, right? I think you’ll be absolutely mad not to.”
Email pged@pressgazette.co.uk to point out mistakes, provide story tips or send in a letter for publication on our "Letters Page" blog